Identification of pharmacogenetic variants from large scale next generation sequencing data in the Saudi population

It is well documented that drug responses are related to Absorption, Distribution, Metabolism, and Excretion (ADME) characteristics of individual patients. Several studies have identified genetic variability in pharmacogenes, that are either directly responsible for or are associated with ADME, giving rise to individualized treatments. Our objective was to provide a comprehensive overview of pharmacogenetic variation in the Saudi population. We mined next generation sequencing (NGS) data from 11,889 unrelated Saudi nationals, to determine the presence and frequencies of known functional SNP variants in 8 clinically relevant pharmacogenes (CYP2C9, CYP2C19, CYP3A5, CYP4F2, VKORC1, DPYD, TPMT and NUDT15), recommended by the Clinical Pharmacogenetics Implementation Consortium (CPIC), and collectively identified 82 such star alleles. Functionally significant pharmacogenetic variants were prevalent especially in CYP genes (excluding CYP3A5), with 10–44.4% of variants predicted to be inactive or to have decreased activity. In CYP3A5, inactive alleles (87.5%) were the most common. Only 1.8%, 0.7% and 0.7% of NUDT15, TPMT and DPYD variants respectively, were predicted to affect gene activity. In contrast, VKORC1 was found functionally, to be highly polymorphic with 53.7% of Saudi individuals harboring variants predicted to result in decreased activity and 31.3% having variants leading to increased metabolic activity. Furthermore, among the 8 pharmacogenes studied, we detected six rare variants with an aggregated frequency of 1.1%, that among several other ethnicities, were uniquely found in Saudi population. Similarly, within our cohort, the 8 pharmacogenes yielded forty-six novel variants predicted to be deleterious. Based upon our findings, 99.2% of individuals from the Saudi population carry at least one actionable pharmacogenetic variant.

Introduction Pharmacogenomics (PGx) studies genetic variations in an individual's drug metabolizing enzymes, associating these with adverse drug events or the level of drug response [1]. Drug efficacy and toxicity may be predicted from the genetic background of individuals, particularly in respect of the Cytochrome P450 (CYP) family of liver enzymes [2,3]. These enzymes catalyze the conversion of substances that are metabolized by our bodies, including pharmaceuticals [4]. Overall, the efficacy of a drug is related to its Absorption, Distribution, Metabolism and Excretion (ADME) [3,5]. The efficacy of a drug may also be associated with drug target polymorphisms. Drug targets can include receptors, enzymes and membrane transporters [2]. CYPs are responsible for deactivation of many drugs through direct metabolic activity or via facilitation of excretion and thus play a central role in ADME related efficacy. CYPs are also important in enzymatic conversion of some drugs from their native to bioactive forms [6].
These differences in drug metabolism highlight the current trend towards individualized pharmacotherapy, such that the right drug is delivered at the right dose to the right patient. A standard dose of a given drug is not always safe, effective or economical in an individual patient [7,8]. The high incidence of adverse drug events (ADEs) represents a heavy burden for the US health care system. Almost 7 million emergency department visits are related to ADEs each year with an estimated cost of $3.5 billion annually [9]. Large scale genomic studies provide opportunities to associate drug responses with individual pharmacogenetic profiles. Such knowledge may improve drug efficacy, result in better outcomes, and in some instances, prevents life-threatening adverse drug events. Dosing no longer needs to be based on the average drug responses of a patient population but can be personalized, taking into consideration individual pharmacogenomic and environmental variation. There are well-established drug-gene interactions, that include but are not limited to, clopidogrel (CYP2C19), warfarin (CYP2C9, VKORC1 and CYP4F2), thiopurines (TPMT, NUDT15), tacrolimus (CYP3A5) and fluorouracil (DPYD) [10][11][12][13][14]. These medications are commonly used globally with Saudi Arabia being no exception [9,15]. Protocols targeting use of the right drug at the right dose in the right person, based on genomic data to personalize treatment, have already been clinically implemented successfully in other countries, e.g. the RIGHT protocol in the US and U-PGx project in Austria, Spain, Great Britain, Greece, Italy, Netherlands and Slovenia [16][17][18]. The allelic frequencies of genes encoding drug-metabolizing enzymes and their phenotypic consequences may vary considerably between ethnic groups. The impact of these allelic variants has been well studied in Caucasians and some other ethnicities, yet poorly in Arabs [5,19]. This study expands pharmacogenetic knowledge of the Arab population. Description of the allelic spectrum of pharmacogenes, both known and novel variants, their frequency, and phenotypic designation in Saudi nationals, will provide a basis for better clinical management in this population.
During the last decade technological advances have enabled comprehensive mapping of human pharmacogenes [20,21]. Next Generation Sequencing (NGS) and High Performance Computing (HPC) are two technologies that have enhanced this field [22,23]. The mining of variants using sequence data from population-based genome programs, provide an opportunity to characterize the pharmacogenomic profiles of each of these groups. Here we describe our findings from the Saudi population.

Results
Mining of NGS data from a total of 11,889 (1,928 PGx gene panels and of 9,961 exomes) unrelated individuals was used to impute allele and haplotype frequencies. We analyzed frequencies of 82 haplotypes distributed across 8 pharmacogenes (Table 1). Nineteen CYP2C9 variants ( � 2, were found in CYP2C19 of which � 17 and � 2 were the most common: 25.9% and 9.6%, respectively. A splice site variant (rs7767746) that is the core allele for CYP3A5 � 3 was present in 84.7% of the population. Three other alleles ( � 6, � 7 and � 8) in CYP3A5 showed MAFs from <0.1% to 2.4%. In CYP4F2, 44.4% of Saudi individuals harbor a � 3 allele, the remaining population being wild type. We detected four VKORC1 alleles; the most common was VKORC1 � 2 (MAF = 53.7%) followed by rs7294, 3730G>A (MAF = 29.2%). Two other VKORC1 variants: 106G>T (rs61742245) and 196G>A (rs72547529) were less commonly observed with MAFs of 2.1% and <0.1%, respectively. Genetic polymorphisms in DPYD and TPMT were rare in the Saudi population. We identified eight variants (rs67376798, rs3918290, rs1801266, rs115232898, rs112766203.1, rs72549304, rs146356975, rs56038477) for DPYD and ten star alleles for TPMT although the overall MAFs for both these were low, 0.7% (DPYD) and 0.9% (TPMT). Two alleles ( � 3 and � 5) were identified in NUDT15. The � 3 allele was present with a MAF of 1.8%; � 5 being much less common (MAF<0.1%), in the population. Functional consequences predicted for PGx alleles in the Saudi population were found predominantly in CYP genes. In CYP3A5 we found the highest number (87.5%) of inactive alleles as a result of the frequently observed intronic splice site CYP3A5 � 3 variant. CYP4F2 showed decreased function alleles in 44.4% of individuals, whereas, in two other CYP genes (CYP2C9 and CYP2C19) reduced function alleles (inactive or decreased) were less common, being 20.6% and 10.1%, respectively. In other prominent PGx genes, allele function was much more conserved, with only 1.8%, 0.7% and 0.7% of NUDT15, TMPT and DPYD variants predicted to affect activity, respectively. In contrast, functionally, VKORC1 was highly polymorphic with 53.7% of Saudi individuals harboring variants predicted to result in decreased activity, whereas 31.3% carry variants leading to increased metabolic activity (Table 1 and S1 Table, Fig 1).
Based on genotypic data and predicted functional consequences of variant alleles we defined genotype-to-phenotype correlations (Fig 2). The phenotyping algorithms were derived from CPIC guidelines which were available only for CYP2C19, CYP2C9, CYP3A5, TPMT, NUDT15, and DPYD. Extensive metabolizer (EM) was the most frequent category for DPYD (98.7%), TPMT (97.8%) and NUDT15 (95.6%). EM status was also the highest (64.5% and 38.3%) for CYP2C9 and CYP2C19 although a significant number of remaining individuals (35.4% and 61.6%) are predicted to carry an altered drug metabolizer status for these two genes. Of all 62 previously reported, predicted to be pathogenic (based on a two-fold scoring) rare variants (MAF<1%), four (1 stop-gain, 2 frameshift and 2 missense variants) with an aggregated frequency of 0.67% were uniquely observed in Saudi individuals when compared with other populations (European, Finish, Hispanic, African, South Asian, East Asian, Ashkenazi Jews and Arabs). Two missense variants were only present in Arabs (Table 2 and S2 Table). Next, we identified 46 novel sequence alterations in seven of the eight PGx genes studied. They included 5 stop-gain, 5 splice site, 1 frameshift, and 35 missense variants with an ADME score of �84%. DPYD revealed the largest number (n = 19) of novel alterations, the most frequent being DPYD:p.Ile971Thr, having a MAF of 0.00055 (Table 3 and S3 Table).

Discussion
Inter-individual differences in drug efficacy drive current trends towards personalized pharmacotherapy targeting delivery of the right drug, at the right dose to the right patient. A standard dose of a given drug is not always safe, effective or economical in an individual patient [7,8]. Mining of large-scale NGS data is a very powerful tool for cataloging the range and frequency of genetic variation in populations [25]. We used whole exome and PGx gene panel NGS data to estimate pharmacogenetic diversity in the Saudi population, thus far poorly recorded in current databases, compared to many other ethnic groups.
Our analysis provides the most comprehensive overview of PGx variability (predicted to be clinically relevant), of 8 phase I and phase II enzymes, in the Saudi population, published to date. We found that 61.6% of the Saudi cohort carry actionable CYP2C19 alleles, which may be associated with an increased risk of major adverse cardiovascular events during antiplatelet therapy with clopidogrel. In this instance ADEs range from stent thrombosis in poor and intermediate metabolizers, to bleeding risk in rapid and ultrarapid metabolizers. This drug was prescribed to several thousand patients who were treated at King Faisal Specialist Hospital and Research Centre, Riyadh, Saudi Arabia (KFSHR&RC) last year alone. Similar to European,    [26] and decreased warfarin activity [12] respectively, were strongly represented in our study population. CYP4F2 acts as an important counterpart to VKORC1 in limiting excessive accumulation of vitamin K [27,28]. Inappropriate warfarin dosing underlies one of the most frequently reported adverse events, acute haemorrhages being one of the most common emergency visits in the US [29]. At KFSH&RC alone, warfarin is prescribed for several thousand patients every year. According to updated CPIC guidelines, genotypes of CYP2C9, VKORC1 and CYP4F2, should be considered together to estimate therapeutic warfarin dosing. One of the key factors strongly considered in dosing algorithms include ethnicity and population related genetic information. The majority of PGx data underpinning these guidelines arises from European, African American and East Asian ancestry [12]. Very little is known about pharmacogenetics in Arabs. Our study shows that the frequency of CYP2C9 � 2, � 3, and VKORC1 � 2 in the Saudi population is similar to that of Europeans [25]. Other CYP2C9 variants common in Africans and present in Europeans (e.g. CYP2C9 � 5, � 6, � 8, and � 11), that should be considered in warfarin dosing algorithms due to associated bleeding risk, show low occurrence in the Saudi population. Based on our findings, and subject to clinical validation, dosing recommendations for warfarin in Saudi patients should follow those for non-African ancestry, as recommended in CPIC guidelines [22]. However, studying the impact of a significantly higher frequency, of the functionally inactive CYP2C9 � 33 allele, on warfarin dosing in the Saudi population is strongly indicated. The vast majority of the Saudi population carries the CYP3A5 � 3 variant that results in a truncated mRNA with loss of protein expression [30]. Frequency of the � 3 allele varies extremely across human populations and is correlated with distance from the equator. Equatorial populations may experience shortage of water and a sodium retaining phenotype in hot climates [31]. Our findings show frequency of this allele in the Saudi population, to be similar to that in six other populations (Ashkenazi, European, American, Finish and both Asians) [24,25]. This gene catalyzes the metabolism of tacrolimus, a mainstay immunosuppressant. Patients with the CYP3A5 � 3 allele require the standard dose of this medication [13]. At KFSH&RC alone~4000 patients received tacrolimus last year, and 22.2% of these may be normal metabolizers (2.6%) or intermediate metabolizers (19.6.%), requiring an increased tacrolimus dose to achieve a successful outcome. Clinical validation of this would be required, particularly given relatedness of donors and recipients in a consanguineous population, where histoincompatibility may be less than observed elsewhere.
Genetic variation in TPMT and NUDT15 are strongly linked to the risk for adverse reactions, to thiopurines commonly used for treatment of malignant and non-malignant conditions [11]. The "normal" starting doses are generally high based on clinical trials which are enriched in wild-type individuals. Full doses are tailored for normal metabolizers and may cause acute toxicity in intermediate and poor metabolizers [32]. Thiopurine tolerance is highly correlated with genetic ancestry [33]. The functionally inactive TPMT � 3A allele is much less common in Saudi individuals relative to American, European and Ashkenazi populations [24,25]. CPIC guidelines recommend a customized dose of thiopurines in compound intermediate metabolizers (intermediate metabolizers in both TMPT and NUDT15 [11]. We identified 0.03% (n = 3) compound intermediate metabolizers in Saudi population. Genetic variation in DPYD is a strong predictor of adverse risk related to use of the chemotherapeutic agent fluorouracil, commonly used in the treatment of various malignancies. Many cases have been reported of severe toxicities or even lethal outcome due to the DPYD poor or null metabolizer phenotype [34]. In our study we identified 1.3% of Saudi individuals who carry either a functionally normal allele plus one null or one functionally decreased allele, and would be predicted to be intermediate metabolizers. Reduced doses of fluorouracil may be indicated for these individuals [10]. More importantly our study detected in the Saudi population, the DPYD rare pathogenic mutation (c.257C>T) that may be responsible for severe toxicity in heterozygous patients or lethality in homozygous cancer patients treated with fluoropyrimidines [35]. We found this variant to be significantly enriched in the Saudi population with approximately 1 in every 333 individuals heterozygous for this allele. This DPYD allele is also present in the Qatari population (0.3%) whereas it is very rare in other populations, with frequencies (relative to the Saudi population) <36-fold in Americans, <52-fold in Europeans, <99-fold in South Asians, and was absent in other compared populations ( Table 2 and S2 Table). Given the high rate of consanguinity (~60%) in Saudi Arabia, we can expect relative to outbred populations, a higher incidence of homozygotes for the DPYD (c.257C>T) mutation. Consanguinity increases the probability of a mate to be a carrier of the same recessive allele [36]. Thus, genotyping DPYD in the Saudi population may have greater clinical relevance. In most of the pharmacogenes screened we observed alleles shared with other Arabs [19,24,37], and some unique to the Saudi population. Amongst those shared with other Arabs, some were observed at significantly (p<0.05) different frequencies (S1 and S2 Tables).
Large-scale NGS data mining enables discovery of novel and rare pharmacogenetic alterations [3]. They are often population specific alleles and are not incorporated within current pharmacogenomic assays. Our study shows that such variants are present in the Saudi population, with computational algorithms predicting their functional significance in multiple instances. They may significantly add to knowledge of potentially actionable variants in ADME genes within the Saudi population and should be further investigated. Novel variants require experimental validation to test their functional effects in drug response [38]. Our study highlights the value of mining large NGS databases as a powerful tool, to improve knowledge of genomic variation within ADME genes, and stimulate their further investigation and eventual implementation in clinical practice. The data we present from one of the larger Middle Eastern countries, provides the most comprehensive overview of pharmacogenetic variants in Arabs, who to date are underrepresented in international genomic databases. We believe it will have both immediate and near-term clinical implications, expanding the application of pharmacogenetics and the practice of precision or individualized medicine in Arab patients.

Study limitations
The clinical impact of variants identified by this study remain in question as information from relevant clinical trials are limited. While PGx variants are predicted to be actionable in other populations, one cannot assume that these variants will ultimately have the same impact in the Saudi population without clinical verification. Another limitation of our study is the technical constraints of exome sequencing; non-coding regions and loci with high genomic complexity are poorly, or not covered at all. Structural changes and copy number variations which may be relevant are not reliably identified by whole exome or gene panel sequencing. Thus, we were not able to call star alleles with whole gene deletions, duplications or hybrids that are common in the assignment of CYP2D6 alleles. Accordingly, we did not include CYP2D6 in our analysis.
Furthermore, actionable variants located in non-coding regions CYP2C19 rs12248560, CYP3A5 rs776746, VKORC1 rs9934438, VKORC1 rs7294, DPYD rs67376798 were not covered by whole exome sequencing, our data for these being exclusively obtained from the PGx custom gene panel only.

Methods
Manuscript was based on access to fully anonymized data from Saudi Human Genome Project for which waiver of consents was granted by the IRB of King Faisal Specialist Hospital and Research Center. The dataset used for mining of pharmacogenomic variants comprised 9,961 exomes and 1,928 PGx custom gene panels (genes are listed in S4 Table), from unrelated Arab individuals sequenced by the Saudi Human Genome Program (SHGP) between 2015 and 2019, as part of a comprehensive investigation of rare diseases in the Saudi population [39,40]. We studied eight genes for which the Clinical Pharmacogenetics Implementation Consortium (CPIC) guidelines are curated (https://cpicpgx.org/guidelines/) and present on FDA labels (https://www.fda.gov.Drugs/ScienceReseach/ucm572698). CYP star allele assignment and their clinical function was derived from Pharmacogene Variation Consortium (https://www. pharmvar.org/genes/) and CPIC allele functional tables. Metabolizer types were inferred based on CPIC guidelines and the Pharmacogenomics Knowledgebase (PharmGKB) https://www. pharmgkb.org/ and they were defined as follows: ultrarapid metabolizer (UM), intermediate metabolizer (IM), extensive/normal metabolizer (EM), poor metabolizer (PM), rapid metabolizer (RM), IM to EM and PM to EM. Our method for Star allele calling was based upon using the Stargazer algorithm (v.1.0.8). This algorithm performs statistical haplotype phasing using Beagle [41] with reference samples from the 1000 Genomes Project [42]. The Beagle method is based on localized haplotype-cluster model, which is an empirical linkage disequilibrium model that can take the local structure in the data into consideration. The Beagle algorithm is accurate and runs fast due to the use of an EM-based algorithm that literately fits the best model to the data. Afterwards, the phased haplotypes computed by Beagle are then matched to publicly available star allele information, mostly in the PharmVar database (https://www. pharmvar.org) and PharmGKB (https://www.pharmgkb.org/). Finally, Stargazer reports the star allele findings in a tabular format along with prediction of the related metabolizer information. Frequencies of intronic and UTR variants were covered only by the PGx panel and their frequency was calculated based on the cohort of 1928 individuals. Variants with MAF <1% were defined as rare and genetic alterations with frequencies that exceeded the observed frequencies in other populations (European, Finish, Hispanic, African, South Asian, East Asian, Ashkenazi Jews and Arabs) by >20-fold were considered as being "Saudi-specific". A Chi-square test was used to determine the statistical difference for allele frequencies between different populations. A p-value less than 0.05 was considered significant. Next, we classified alleles as novel if they were not observed in: 1000 Genomes (phase3), gnomAD (v.3.1.1), Exac (v.0.3) and Kaviar (v.160204). Functional consequence of PGx rare Saudi-specific and novel variants was predicted using a two-fold approach. Any variants with a high IMPACT rating, such as frameshift indels or stop loss variants were considered to be deleterious [43,44]. We then applied the ADME-optimized framework that is an ensemble of deleteriousness prediction methods for predicting deleteriousness in pharmacogenes. We used 18 prediction algorithms to compute the ADME scores including CADD, SIFT, PolyPhen, LRT (likelihood ratio test), MutationAssessor, FATHMM, FATHMM-MKL, PROVEAN, VEST3, DANN, MetSVM, MetaLR, GERP++, SiPhy, PhyloP-vertebrate, PhyloP-mammalian, PhastCons-vertebrate, and PhastCons-mammalian. ADME scores larger than 84% were considered to affect pharmacogene functionality [45]. We used phenotypes generated from Stargazer for CYP2C19, CYP2C9, CYP3A5, DPYD, NUDT15 and TPMT to determine the percentage of individuals predicted to have actionable PGx variants. For VKORC1 (rs9934438) and CYP4F2 � 3 (rs2108622), individuals carrying heterozygous (CT) or homozygous (TT) and heterozygous (GA) or homozygous (AA), respectively were considered to have an actionable variant in those genes.
Supporting information S1